Multi-agent Q-Learning of Channel Selection in 
Multi-user Cognitive Radio Systems: 
A Two by Two Case 
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Abstract — Resource allocation is an important issue in cogni- 
tive radio systems. It can be done by carrying out negotiation 
among secondary users. However, significant overhead may be 
incurred by the negotiation since the negotiation needs to be done 
frequently due to the rapid change of primary users' activity. 
In this paper, a channel selection scheme without negotiation 
is considered for multi-user and multi-channel cognitive radio 
systems. To avoid collision incurred by non-coordination, each 
user secondary learns how to select channels according to its 
experience. Multi-agent reinforcement leaning (MARL) is applied 
in the framework of Q-learning by considering the opponent 
secondary users as a part of the environment. The dynamics 
of the Q-learning are illustrated using Metrick-Polak plot. A 
rigorous proof of the convergence of Q-learning is provided via 
the similarity between the Q-learning and Robinson-Monro algo- 
rithm, as well as the analysis of convergence of the corresponding 
ordinary differential equation (via Lyapunov function). Examples 
are illustrated and the performance of learning is evaluated by 
numerical simulations. 



I. Introduction 

In recent years, cognitive radio has attracted extensive 
studies in the community of wireless communications. It 
allows users without license (called secondary users) to access 
licensed frequency bands when the licensed users (called 
primary users) are not present. Therefore, the cognitive radio 
technique can substantially alleviate the problem of under- 
utilization of frequency spectrum [11] [10]. 

The following two problems are key to the cognitive radio 
systems: 

> Resource mining, i.e. how to detect the available resource 
(the frequency bands that are not being used by primary 
users); usually it is done by carrying out spectrum sens- 
ing. 

> Resource allocation, i.e. how to allocate the detected 
available resource to different secondary users. 

Substantial work has been done for the resource mining. 
Many signal processing techniques have been applied to sense 
the frequency spectrum [17], e.g. cyclostationary feature [5], 
quickest change detection [8], collaborative spectrum sensing 
[3]. Meanwhile, plenty of researches have been conducted for 
the resource allocation in cognitive radio systems [12] [6]. 
Typically, it is assumed that the secondary users exchange 
information about detected available spectrum resources and 
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Fig. 1: Illustration of competition and conflict in multi-user 
and multi-channel cognitive radio systems. 



then negotiate the resource allocation according to their own 
requirements of traffic (since the same resource cannot be 
shared by different secondary users if orthogonal transmis- 
sion is assumed). These studies typically apply theories in 
economics, e.g. game theory, bargaining theory or microeco- 
nomics. 

However, in many applications of cognitive radio, such a 
negotiation based resource allocation may incur significant 
overhead. In traditional wireless communication systems, the 
available resource is almost fixed (even if we consider the 
fluctuation of channel quality, the change of available resource 
is still very slow and thus can be considered stationary). There- 
fore, the negotiation need not be carried out frequently and 
the negotiation result can be applied for a long period of data 
communication, thus incurring tolerable overhead. However, 
in many cognitive radio systems, the resource may change 
very rapidly since the activity of primary users may be highly 
dynamic. Therefore, the available resource needs to be updated 
very frequently and the data communication period should be 
fairly short since minimum violation to primary users should 
be guaranteed. In such a situation, the negotiation of resource 
allocation may be highly inefficient since a substantial portion 
of time needs to be used for the negotiation. To alleviate such 
an inefficiency, high speed transceivers need to be used to 
minimize the time consumed on negotiation. Particularly, the 
turn-around time, i.e. the time needed to switch from receiving 
(transmitting) to transmitting (receiving) should be very small, 
which is a substantial challenge to hardware design. 

Motivated by the previous discussion and observation, in 



2 



this paper, we study the problem of spectrum access without 
negotiation in multi-user and multi-channel cognitive radio 
systems. In such a scheme, each secondary user senses chan- 
nels and then choose an idle frequency channel to transmit 
data, as if no other secondary user exists. If two secondary 
users choose the same channel for data transmission, they 
will collide with each other and the data packets cannot be 
decoded by the receiver(s). Such a procedure is illustrated in 
Fig. Q] where three secondary users access an access point via 
four channels. Since there is no mutual communication among 
these secondary users, conflict is unavoidable. However, the 
secondary users can try to learn how to avoid each other, 
as well as channel qualities (we assume that the secondary 
users have no a priori information about the channel qualities), 
according to its experience. In such a context, the cognition 
procedure includes not only the frequency spectrum but also 
the behavior of other secondary users. 

To accomplish the task of learning channel selection, multi- 
agent reinforcement learning (MARL) [1] is a powerful tool. 
One challenge of MARL in our context is that the secondary 
users do not know the payoffs (thus the strategy) of each 
other in each stage; thus the environment of each secondary 
user, including its opponents, is dynamic and may not assure 
convergence of learning. In such a situation, fictitious play 
[2] [14], which estimates other users' strategy and plays the 
best response, can assure convergence to a Nash equilibrium 
point within certain assumptions. As an alternative way, we 
adopt the principle of Q-learning, i.e. evaluating the values 
of different actions in an incremental way. For simplicity, 
we consider only the case of two secondary users and two 
channels. By applying the theory of stochastic approximation 
[7], we will prove the main result of this paper, i.e. the 
learning converges to a stationary point regardless of the 
initial strategies (Propositions Q] and [2] ). Note that our study 
is one extreme of the resource allocation problem since no 
negotiation is considered while the other extreme is full 
negotiation to achieve optimal performance. It is interesting 
to study the intermediate case, i.e. limited negotiation for 
resource allocation. However, it is beyond the scope of this 
paper. 

The remainder of this paper is organized as follows. In 
Section [II] the system model is introduced. The proposed 
Q-learning for channel selection is explained in Section [III] 
Intuitive explanation and rigorous proof for convergence are 
explained in Sections [IV] and [V] respectively. The numerical 
results are provided in Section [VT] while the conclusions are 
drawn in Section I VIII 
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Fig. 2: Payoff matrices in the game of channel selection. 

The following assumptions are placed throughout this paper. 

• The rewards {Rij} are unknown to both secondary users. 
They are fixed throughout the game. 

• Both secondary users can sense both channels simul- 
taneously, but can choose only one channel for data 
transmission. It is more interesting and challenging to 
study the case that the secondary users can sense only 
one channel, thus forming a partially observable game. 
However, it is beyond the scope of this paper. 

• We consider only the case that both channels are available 
since the actions that the secondary users can take are 
obvious (transmit over the only available channel or not 
transmit if no channel is available). Thus, we ignore the 
task of sensing the frequency spectrum, which has been 
well studied by many researchers, and focus on only the 
cognition of the other secondary user's behavior. 

• There is no communication between the two secondary 
users. 

III. Game and Q-learning 

In this section, we introduce the corresponding game and 
the application of Q-learning to the channel selection problem. 

A. Game of Channel Selection 

The channel selection problem is a 2 x 2 game, in which 
the payoff matrices are given in Fig. [2] Note that the actions, 
denoted by ai(i) for user i at time t, in the game are the 
selections of channels. Obviously, the diagonal elements in the 
payoff matrices are all zero since conflict incurs zero reward. 

It is easy to verify that there are two Nash equilibrium 
points in the game, i.e. the strategies such that unilaterally 
changing strategy incurs its own performance degradation. 
Both equilibrium points are pure, i.e. cla = 1,o,b = 2 and 
a a — 2, as = 1 (orthogonal transmission). 



II. System Model 

For simplicity, we consider only two secondary users, 
denoted by A and B, and two channels, denoted by 1 and 
2. The reward to secondary user i, i = A, B, of channel j, 
7 = 1, 2, is if secondary user transmits data over channel 
j and channel j is not interrupted by primary user or the other 
secondary user; otherwise the reward is since the secondary 
user cannot convey any information over this channel. For 
simplicity, we denote by j~ the other user (channel) different 
from user (channel) j. 



B. Q-function 

Since we assume that both channels are available, then there 
is only one state in the system. Therefore, the Q-function 
is simply the expected reward of each action (note that, in 
traditional learning in stochastic environment, the Q-function 
is defined over the pair of state and action), i.e. 



Q{a) = E[R{a% 



(1) 



where a is the action, R is the reward dependent on the action 
and the expectation is over the randomness of the other user's 
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action. Since the action is the selection of channel, we denote 
by Qij the value of selecting channel j by secondary user i. 

C. Exploration 

In contrast to fictitious play [2], which is deterministic, the 
action in Q-learning is stochastic to assure that all actions 
will be tested. We consider Boltzmann distribution for random 
exploration, i.e. 

e Qij/i 

P(user i choose channel j) = — j— , (2) 

e Qij/f _|_ gVij- /t 

where 7 is called temperature, which controls the frequency 
of exploration. 

Obviously, when secondary user i selects channel j, the 
expected reward is given by 



E[Ri(j)] 



■h 



(3) 



since secondary user i chooses channel j with proba- 

Q i-n'i 

bility q _ % — q _ _ (collision happens and secondary 

e i -f e i 3 

user i receives no reward) and channel j~ with probability 

Q i-j- H 

—a — £7- — n t: (the transmissions are orthogonal and sec- 

ondary user i receives reward Rij). 



D. Updating Q-Functions 

In the procedure of Q-learning, the Q-functions are updated 
after each spectrum access via the following rule: 

Qij(t + 1) = (1 - oy)Qy(i) + a i3 {t)ri{t)I{ai{t) = j), (4) 

where ay (t) is a step factor (when channel j is not selected by 
user i, a.ij (t) = 0) and r,i (t) is the reward of secondary user i 
and / is characteristic function for the event that channel j is 
selected at the t-th spectrum access. Our study is focused on 
the dynamics of (0J. To assure convergence, we assume that 



/(*) = 



Vi = A,B,j = 1,2. 



(5) 



IV. Intuition on Convergence 

As will be shown in Propositions Q] and [2] the updating rule 
of Q functions in will converge to a stationary equilibrium 
point close to Nash equilibrium if the step factor satisfies 
certain conditions. Before the rigorous proof, we provide an 
intuitive explanation for the convergence using the geometric 
argument proposed in [9]. 

The intuitive explanation is provided in Fig. [3] (we call 
it Metrick-Polak plot since it was originally proposed by A. 
Metrick and B. Polak in [9]), where the axises are pa = 
and ps = > respectively. As labeled in the figure, the 
plane is divided into four regions by two lines fiA — 1 and 
Pb = 1, in which the dynamics of Q-learning are different. 
We discuss these four regions separately: 

• Region I: in this region, Qai > Qa2\ therefore, sec- 
ondary user A prefers visiting channel 1; meanwhile, 
secondary user B prefers accessing channel 2 since 
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Fig. 3: Illustration of the dynamics in the Q-learning. 



Qbi > Qb2, then, with large probability, the strategies 
will converge to the Nash equilibrium point in which 
secondary users A and B access channels 1 and 2, 
respectively. 

• Region II: in this region, both secondary users prefer 
accessing channel 1, thus causing many collisions. There- 
fore, both Qai and Qbi will be reduced until entering 
either region I or region III. 

• Region III: similar to region I. 

• Region IV: similar to region II. 

Then, we observe that the points in Regions II and IV 
are unstable and will move into Region I or III with large 
probability. In Regions I and III, the strategy will move close to 
the Nash equilibrium points with large probability. Therefore, 
regardless where the initial point is, the updating rule in 
will generate a stationary equilibrium point with large 
probability. 

V. Stochastic Approximation Based Convergence 

In this section, we prove the convergence of the Q-learning. 
First, we find the equivalence between the updating rule (0]) 
and Robbins-Monro iteration [13] for solving an equation 
with unknown expression. Then, we apply the conclusion in 
stochastic approximation [7] to relate the dynamics of the 
updating rule to an ordinary differential equation (ODE) and 
prove the stability of the ODE. 



A. Robbins-Monro Iteration 

At a stationary point, the expected values of Q-functions 
satisfy the following four equations: 

Q -= ^f+eQ^-h * ' AUJ (6) 

Define q = (Qai, Qa2, Qbi, Qb2) T ■ Then © can be 
rewritten as 



g(q) = A(q)r - q = 0, 



(7) 



where r = (Rai, Ra2, Rbi, Rb2) and the matrix A (as a 
function of q) is given by 



We have 



Q H Q M j 11 1 — J 

e 1 i +e * 3 



0, 



if i ^ j 



(8) 



Then, the updating rule in is equivalent to solving the 
equation (jTj (the expression of the equation is unknown since 
the rewards, as well as the strategy of the other user, are 
unknown) using Robbins-Monro algorithm [7], i.e. 



q(t + 1) = q(t) + a(t)Y(t), 



(9) 



where Y(t) is a random observation on function g contami- 
nated by noise, i.e. 



Y(t) = r(t)-q(t) 

= f(f) - q(t) + f (t) - r 
= g t (q(i))+5M(i), 



(10) 



where gt(q(i)) = r — q(i), 8M(t) = f(t) — q* is noise and 
(recall that r, (t) means the reward of secondary user % at time 

*) 



r(t) - A(q(i))r. 



(11) 



B. OD£ ant/ Convergence 

The procedure of using Robbins-Monro algorithm (i.e. the 
updating of Q-function) is the stochastic approximation of the 
solution of the equation. It is well known that the convergence 
of such a procedure can be characterized by an ODE. Since the 
noise 8M(t) in ( TTOb is a Martingale difference, we can verify 
the conditions in Theorem 12.3.5 in [7] (the verification is 
omitted due to limited length of this paper) and obtain the 
following proposition: 

Proposition 1: With probability 1, the sequence q(i) con- 
verges to some limit set of the ODE 

q = g(q)- (12) 
What remains to do is to analyze the convergence property 
of the ODE (fT2l . We obtain the following proposition: 

Proposition 2: The solution of ODE (TTZb converges to the 
stationary point determined by (0. 

Proof: We apply Lyapunov's method to analyze the con- 
vergence of the ODE dl2) . We define the Lyapunov function 
as 

V(t) = ||g(t)|| 2 

= -Q«(*)) a - d3) 

Then, we examine the derivative of the Lyapunov function 
with respect to time t, i.e. 



dV(t) 



« = ^ 

= 2£ 



d(f«(t)-Q«(t)) 



dt 



(hjit) - Qij(t)) 



tikij(t) 



(14) 



where ey(i) = hj(t) - Qij{t). 



de ij(t) 
dt 



dfjj(t) _ dQjj(t) 

~~dt 

dfijit) 



dt 



dt 
eij(i), 



dt 



where r»j (t) = (Ar)y and we applied the ODE i\2i . 

dr- ■ (t) 

Then, we focus on the computation of — ■ When 
and j = 1, we have 

Q^Vli (*) _ d_ 
dt 



R Aie QB2h 

dt ye^si/T + e^ B2 ^ 
R Aie Qmh e QB3h 





+ e Q B2 /7) 2 


(dQB»(t) 


dQ B iit) 




dt 


i?Aie Qfll 





x (e B2 (t) - e m (t)) , 

where we applied the ODE ( TT2l again. 
Using similar arguments, we have 

dr A 2{t) R A 2e QBlh e QB2 ^ 



dt 



and 



-y (e<2-Bi/7 -|- e QBz/yY 
x (e B1 (t) - e B2 (*)) , 

i? B ie QAl/7 e QA2/7 
x (e^W-CAiW), 

R B2e QAl/l e QA2h 

dt j ( £ Qai/7 _|_ e QA2hy 

X (eAl(t) - CA2(*)) • 

Combining the above results, we have 

1*39 _ _ E4(t) 



and 



dr B i{t) 
dt 



dr B 2{t) 



2 



Cl2£AieB2 — CuCAl^Bl 
C 2 \f-A2^Bl — C22eA2£B2 



where 

Cl2 

and 

Cn = 
and 

C21 = 
and 

C22 = 



RAie^ 
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•y (e^si/i -|- e^B2/iy -y (e<2Ai/7 4 e QA2/-iy 



•y {^Qb\Ii 4 gQs2/7) -y (e'S- 41 / 7 + e^ 2 / 7 ) 



flA2e QBl/7 e QB2/7 



+ 



<y (gQsi/7 4. e<3-B 2 /7) 2 -y (e^- 41 / 7 4 e^ A2 ^) 2 



+ 



Rb^ 



<-y ( e Qsi/7 4 e Qs2/7) 2 ,y ( e QAl/l 4 e QA2/7) 
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Fig. 4: An example of dynamics of the Q-learning. 



Fig. 5: An example of dynamics of the Q-learning. 



It is easy to verify that 



(^ £ Qbi/i -|- e Qnihy 



< 1. 



(25) 



Now, we assume that R^ < 2-f, then Cy < 2. Therefore, 
we have 



1 dV(t) 

2 dt 



= -(e A i-e B 2) 2 - (tA2-£Bi) 2 <0. (26) 



Therefore, when i?y < 27, the derivative of the Lyapunov 
function is strictly negative, which implies that the ODE ( TTZb 
converges to a stationary point. 

The final step of the proof is to remove the condition 
Rij < 27. This is straightforward since we notice that the 
convergence is independent of the scale of the reward Rij. 
Therefore, we can always scale the reward such that R^ < 2-f. 
This concludes the proof. 



VI. Numerical Results 

In this section, we use numerical simulations to demonstrate 
the theoretical results obtained in previous sections. For all 
simulations, we use atij(t) — where ao is the initial 
learning factor. 



A. Dynamics 

Figures [4] and show the dynamics of Qj± versus of 
several typical trajectories. Note that 7 = 0.1 in Fig. |4] and 
7 = 0.01 in Fig. [5] We observe that the trajectories move from 
unstable regions (II and IV in Fig. O to stable regions (I and 
III in Fig. [3}. We also observe that the trajectories for smaller 
temperature 7 is smoother since less explorations are carried 
out. 

Fig. [6] shows the evolution of the probability of choosing 
channel 1 when 7 = 0.1. We observe that both secondary users 
prefer channel 1 at the beginning and soon secondary user A 
intends to choose channel 2, thus avoiding the collision. 




100 



time 



Fig. 6: An example of the evolution of channel selection 
probability. 



B. Learning Speed 

Figures [7] and [8] show the delays of learning (equivalently, 
the learning speed) for different learning factor ao and dif- 
ferent temperature 7, respectively. The original Q values are 
randomly selected. When the probabilities of choosing channel 
1 are larger than 0.95 for one secondary user and smaller than 
0.05 for the other secondary user, we claim that the learning 
procedure is completed. We observe that larger learning factor 
ao results in smaller delay while smaller 7 yields faster 
learning procedure. 

C. Fluctuation 

In practical systems, we may not be able to use vanishing 
Q4j (t) since the environment could change (e.g. new secondary 
users emerge or the channel qualities change). Therefore, we 
need to set a lower bound for ctij(t). Similarly, we also need 
to set a lower bound for the probability of exploring all actions 
(notice that the exploration probability in (O can be arbitrarily 
small). Fig.|9]shows that the learning procedure may yield sub- 
stantial fluctuation if the lower bounds are improperly chosen 
(the lower bounds for Q-ijit) and exploration probability are 
set as 0.4 and 0.2 in Fig. |9). 
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VII. Conclusions 




delay 



10 



We have discussed the 2x2 case of learning procedure 
for channel selection without negotiation in cognitive radio 
systems. During the learning, each secondary user considers 
the channel and the other secondary user as its environment, 
updates its Q values and takes the best action. An intuitive 
explanation for the convergence of learning is provided using 
Metrick-Polak plot. By applying the theory of stochastic 
approximation and ODE, we have shown the convergence 
of learning under certain conditions. Numerical results show 
that the secondary users can learn to avoid collision quickly. 
However, if parameters are improperly chosen, the learning 
procedure may yield substantial fluctuation. 

References 



Fig. 7: CDF of learning delay with different learning factor M 

a . 

[2] 
[3] 



1 

0.9 
0.8 
0.7 
0.6 
a 0.5 
0.4 
0.3 
0.2 
0.1 



7=0.1 

y-0.05 

7=0.01 

10 1 



delay 



10 



10 



Fig. 8: CDF of learning delay with different temperature 7. 




u 



1 



J] 



[4] 
[5] 

[6] 
[7] 
[8] 

[9] 
[10] 

[II] 
[12] 

[13] 
[14] 
[15] 
[16] 
[17] 



L. Bu§oniu, R. Babuska and B. D. Schutter, "A comprehensive survey 
of multiagent reinforcement learning," IEEE Trans. Systems, Man and 
Cybernetics, Part C, vol.38, no.2, pp.156-172, March" 2008. 

D. Fudenberg and D. K. Levine, The Theory of Learning in Games. The 
MIT Press, Cambridge, MA, 1998. 

A. Gahsemi and E. S. Sousa, "Collaborative spectrum sensing for oppor- 
tunistic access in fading environment," in Proc. of IEEE International 
Symposium of New Frontiers in Dynamic Spectrum Access Networks 
(DySPAN), 2005. 

E. Hossain, D. Niyato and Z. Han, Dynamic Spectrum Access in 
Cognitive Radio Networks. Cambridge University Press, UK, 2009. 

K. Kim, I. A. Akbar, K. K. Bae, et al, "Cyclostationary approaches to 
signal detection and classificition in cognitive radio," in Proc. of IEEE 
International Symposium of New Frontiers in Dynamic Spectrum Access 
Networks (DySPAN), April 2007. 

C. Kloeck, H. Jaekel and F. Jondral, "Multi-agent radio resource 
allocation," Mobile Networks and Applications, vol. 11, no. 6, Dec. 2006. 
H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive 
Algorithms and Applications. Springer, 2003. 

H. Li, C. Li and H. Dai, "Quickest spectrum sensing in cognitive radio," 
in Proc. of the 42nd Conference on Information Scicens and Systems 
(CISS), Princeton, NJ, 2008. 

A. Metrick and B. Polak, "Fictitious play in 2 X 2 games: A geometric 
proof of convergence," Economic Theory, pp. 923-933, 1994. 
J. Mitola, "Cognitive radio for flexible mobile multimedia communica- 
tions," in Proc. IEEE Int. Workshop Mobile Multimedia Communica- 
tions, pp. 3-10, 1999. 

J. Mitola, "Cognitive Radio," Licentiate proposal, KTH, Stockholm, 
Sweden. 

D. Niyato, E. Hossain and Z. Han, "Dynamics of multiple-seller and 
multiple-buyer spectrum trading in cognitive radio networks: A game 
theoretic modeling approach," to appear in IEEE Trans. Mobile Com- 
puting. 

H. Robbins and S. Monro, "A stochastic approximation method," The 
Annals of Mathematical Statistics, vol. 2, pp. 400^107, 1951. 
J. Robinson, "An iterative method of solving a game," The Annals of 
Mathematics, vol. 54, no. 2, pp. 296-301, 1969. 

R. S. Sutton and A. G. Barto, Reinforcement Learning: A Introduction, 
The MIT Press, Cambridge, MA, 1998. 

C. Watkins, Learning From Delayed Rewards, PhD Theis, The Univer- 
sity of Cambridge, UK, 1989. 

Q. Zhao and B. M. Sadler, "A survey of dynamic spectrum access," 
IEEE Signal Processing Magazine, vol. 24, pp. 79-89, May. 2007. 



400 600 
time 



800 



1000 



Fig. 9: Fluctuation when improper lower bounds are selected. 



